Fix time source mismatch in client RPC path after #3268 migration#3283
Conversation
|
Hi @chenBright — could you take a look when you have time? This PR completes the time-source migration started in #3268; the bisect in the description pinpoints |
|
I think it would be more appropriate to roll back the changes to LocalityAwareLoadBalancer. |
ok |
PR apache#3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before apache#3268): 0/500 fail * commit 12fb539 (apache#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of apache#3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch.
9a6dda3 to
2c285ff
Compare
|
@chenBright 谢谢 review,已按你的建议改成回退 新增的 force-pushed 到 |
|
LGTM |
这个 PR 解决什么问题
#3268("Use monotonic time instead of wall time")把
LocalityAwareLoadBalancer::Weight::Update的end_time_us与LocalityAwareLoadBalancer::Describe的now切到了butil::cpuwide_time_us(),但所有产生CallInfo::begin_time_us的调用方还在用
butil::gettimeofday_us():Channel::CallMethod(channel.cpp:451)→Controller::IssueRPC→Controller::Call::begin_time_us→SelectIn::begin_time_us→CallInfo::begin_time_usController::OnVersionedRPCReturned中的 backup-request 与普通重试入口(
controller.cpp:672, 715)都是IssueRPC(butil::gettimeofday_us())时钟域错配导致:
每个 sample 都被 time-skew 短路丢掉。
_time_q永远拿不到新条目,_avg_latency一直是 0,locality-aware 反馈机制实际被静默关掉。下游可观测的现象:冷启动场景下,
list://channel +lb=la+ 2 个backend、
max_retry=1,第一个 backend 不通时,retry 偶发返回EHOSTDOWN(Fail to select server from list://...),即使第二个backend 是健康的。
Bisect / 复现
我在内部 fork 上对 brpc 1.16.1 与 1.17.0-rc2 之间的 51 个 commit 做了
bisect(master 代码不变,只换 brpc 版本),每一步在 2-backend
list://+lb=la+max_retry=1下跑探测 RPC 重复 500 次:c41e838aRelease 1.16.1771de31eFix UAF in LatencyRecorder12fb539aUse monotonic time (#3268)b223e60bFix RDMA resource604dad0c1.17.0-rc2在
12fb539a上做单行隔离实验:只把int64_t LocalityAwareLoadBalancer::Weight::Update(const CallInfo& ci, ...) { - const int64_t end_time_us = butil::cpuwide_time_us(); + const int64_t end_time_us = butil::gettimeofday_us();撤回(保留 #3268 其余所有改动),测试就回到 0/500 PASS。这把回归精确钉
到了那一行
end_time_us切换。改动内容
按 @chenBright 的建议,把
LocalityAwareLoadBalancer一侧从 #3268 引入的 cpuwide 改动回退,让它继续使用
gettimeofday_us(),与现有调用方对齐:
src/brpc/policy/locality_aware_load_balancer.cppgettimeofday_ussrc/brpc/policy/locality_aware_load_balancer.cppWeight::Update::end_time_us回退src/brpc/policy/locality_aware_load_balancer.cppDescribe::now回退Channel::CallMethod与Controller::OnVersionedRPCReturned中那些butil::gettimeofday_us()调用都不用动;Controller::_begin_time_us、Controller::latency_us()等公开 API 的墙上时钟语义保持不变。测试
新增
test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source,不依赖 Server / Channel 的 LB 单元测试。它驱动若干次
SelectServer+Feedback循环,begin_time_us用butil::gettimeofday_us()(与Channel::CallMethod当前行为一致),并断言
_avg_latency反映了真实的过期时间,而不是被 time-skew 短路卡在0。
如果将来有人再把
Weight::Update::end_time_us切走,或者某个CallInfo::begin_time_us的调用方漂到了别的时钟域,这个测试都会失败。不起 Server / 不依赖网络,CI 上是确定性结果(本地 ~16ms)。
la_sanity与la_selection_too_long也都仍然通过。Check List
la_records_latency_with_consistent_time_source。Controller::_begin_time_us/latency_us()保持墙上时钟语义不变。